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© Document copying deterrent method. 

© The present invention is directed to a method of deterring the illicit copying of electronically published 
documents. It includes utilizing a computer system to electronically publish a plurality of copies of a document 
having electronically created material thereon for distribution to a plurality of subscribers and operating 
programming within the computer system so as to perform the identification code functions. The steps are to 
encode the plurality of copies each with a separate, unique identification code, the identification code being 
based on a unique arrangement of the electronically created material on each such copy; and. creating a 
codebook to correlate each such identification code to a particular subscriber. In some embodiments, decoding 
methods are included with the encoding capabilities. The unique arrangement of the electron.caliy created 
material may be based on line-shift coding, word-shift coding, or feature enhancement coding (or combinations 
of these) and may be effected through bitmap alteration of document format file alteration. 
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Field of the Invention 

The present invention involves methods of deterring illicit copying of electronically published docu- 
ments by creating unique identification codes specific to each subscriber. . Each copy of the published 
5 document has a unique arrangement of electronically created material, e.g. print material or display 
material, which is not quickly discernable to the untrained human eye. These unique identification codes 
discourage illicit copying and enable a publisher/copyright owner to analyze illicit copies to determine the 
source subscriber. 

70 Detailed Background 

When the quality of reproductions from copy machines became comparable with the original, the cost 
of copies was reduced to a few pennies per page, and the time it took to copy a page was reduced to a 
second or less, then copy machines started to present a threat to publishers. The problem is intensified in 

75 the electronic domain. The quality of a reproduction is identical with the original, there is almost no cost 
associated with making the copy, and with a single keystroke, hundreds of pages can be copied in a 
fraction of a second. In addition, electronic documents can be distributed to large groups, by electronic mail 
or network news services, with almost no effort on the part of the sender. 

The ability to easily and inexpensively copy and distribute electronic documents is considered to be the 

20 main technical problem that must be overcome before electronic publishing can become a viable alternative 
to conventional publishing. Preventing an individual from duplicating a file of data that is in his possession is 
an extremely difficult, if not impossible task. Instead of trying to prevent duplication of general data files, the 
present invention is directed to making electronic publishing more acceptable by making it possible to 
identify the original owner of a bitmap version of the text portion of a document. With the current copyright 

25 laws, the present invention should be adequate to discourage much of the copying and distribution that 
might otherwise occur. An interesting result of the present invention method is that a publisher or copyright 
owner can also determine who the original belonged to when reproduced copies are found. 

Summary of The Invention 

30 

The present invention is directed to a method of deterring the illicit copying of electronically published 
documents. It includes utilizing a computer system to electronically publish a plurality of copies of a 
document having electronically created material thereon for distribution to a plurality of subscribers and 
operating programming within the computer system so as to perform the identification code functions. The 

35 steps are to encode the plurality of copies each with a separate, unique identification code, the identification 
code being based on a unique arrangement of the electronically created material on each such copy; and, 
creating a codebook to correlate each such identification code to a particular subscriber. In some 
embodiments, decoding methods are included with the encoding capabilities. The unique arrangement of 
the electronically created material may be based on line-shift coding, word-shift coding, or feature 

40 enhancement coding (or combinations of these) and may be effected through bitmap alteration or document 
format file alteration. 

Brief Description of The Drawings 

45 The present invention is more fully understood when the present invention specification herein is taken 
in conjunction with the drawings appended hereto, wherein: 

Figure 1 illustrates a flow diagram of an overview of preferred embodiments of the present invention 
methods; 

Figure 2 illustrates a flow diagram of an encoder operation in a present invention method; 
so Figure 3 illustrates a flow diagram of a decoder operation in a present invention method; 

Figure 4 are pseudocodes for simple line spacing encoder operations for PostScript files; 

Figure 5 shows a profile of a recovered document using text line-shift encoding; 

Figure 6 illustrates three examples of feature enhancing in a 5 x 5 pixel array; 

Figure 7 illustrates line-shift encoding with line space measurements shown qualitatively; 
55 Figure 8 shows word-shift encoding with vertical lines to emphasize normal and shift word spacing; 

Figure 9 illustrates the same text as in Figure 8 but without vertical lines to demonstrate that both 

unshifted and shifted word spacing appears natural to the untrained eye; 

Figure 10 shows an example of text of a document with no feature enhancement; 
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Figure 11 shows the Figure 10 text with feature enhancement; 

Figure 12 illustrates the Figure 1 1 text with the same features enhanced with exaggeration; 
Figure 13 shows a comparison of baseline and centroid detection results as line spacing and font size 
are varied; 

Figure 14 shows a comparison of baseline and centroid detection as a text page is recursively copied. 
The results are for 10 point font size with a single pixel spacing; and 
Figure 15 shows a schematic diagram for a noise accumulation model. 

Detailed Description of The Present Invention 

One object and general purpose of techniques of the present invention is to provide a means of 
discouraging the illegitimate copying and dissemination of documents, in the present invention methods, 
document marking embeds a unique identification code within each copy of a document to be distributed, 
and a codebook correlating the identification code to a particular subscriber (recipient) is maintained, 
is Hence, examination of a recovered document (or in certain cases, a copy of a distributed document) reveals 
the identity of the original document recipient. 

Document marking can be achieved by either altering text formatting, i.e. lines, words, or groups of 
characters, or by altering certain characteristics of textual elements (e.g. altering individual characters). The 
alterations used in marking a document in the present invention method enables the publisher to: 
20 (1 .) embed a codeword that can be identified for security (traceability) purposes, and 
(2.) alter features with as little visible change of appearance as possible. 

Certain types of markings can be detected in the presence of noise, which may be introduced in 
documents by printing, scanning, plain paper copying, etc. 

"Encoded" documents using the present invention methods, can provide security in several possible 
25 ways, including the following: 

(1.) A document can be coded specifically for each site, subscriber, recipient, or user (hereinafter 
referred to as "subscriber"). Then, any dissemination of an encoded document outside of the intended 
subscriber may be traced back to the intended subscriber. 

(2.) A document code can mark a document as legitimately matched to a specific installation of a user 
30 interface (e.g. a particular subscriber computer workstation). If an attempt is made to display a document 
unmatched to this interlace, then that interface can be configured in such a way as to refuse display of 
the document. 



1.0 Overview of Applications 
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An overview of document production, distribution, and user interaction according to the present 
invention is illustrated in Figure 1 . This shows three paths a document can follow from the publisher 3 to a 
user. The first is the regular paper copy distribution channel 11 (i.e. a user receives a paper journal, etc. 
from the publisher). The second and third paths are electronic dissemination 13 via document database 21 
and electronic document interface 23. for user display 15 or through a user printer 17 to create a printed 
document. Whether from the paper copy distribution channel 11 or from the user printer 17, plain paper 
copier 27, for example, may be then used to create illicit paper copy 29. Variations could, of course, be 
made to the flow chart of Figure 1 without exceeding the scope of the present invention. For example, an 
illicit user could scan a legal version with a scanner and then electronically reproduce illicit copies. The 
present invention methods cover documents that are applicable along any of these or similar types of 
distribution paths, e.g. published electronically and distributed via fax. via radio communication computer, 
etc. Document coding is performed prior to document dissemination as indicated by encoder 9. 

Documents are encoded while still in electronic form (Figure 2). The techniques to encode documents 
may be used in either of the two following forms: images or formatted document Hies. The image 
so representation describes each page (or sub-page) of a document as an array of pixels. The image may be 
black and white (also called bitmap), gray-scale, or color. In the remainder of this text, the image 
representation is simply referred to as a "bitmap", regardless of the image color content. The formatted 
document file representation is a computer file describing the document content using such standard format 
description languages as PostScript, troff. SGML, etc. 

In a typical application, a bitmap is generated from a formatted document file. The coding technique(s) 
used in the present invention to mark a document will depend in part on the original format supplied to the 
encoder, and the format that the subscriber sees. It is assumed that once a subscriber sees a document 
(e.g. displays a page on a workstation monitor), then he or she can capture and illegitimately disseminate 
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that document. Therefore, coding must be embedded before this subscriber stage. Thus, as in Figure 2, the 
electronic document 31 is encoded at encoder 33 according to a preselected set of alterations set up in 
codebook 35. (It is not essential that the codebook predate the encoding, in some embodiments, the 
codebook may be created from logging of identification codes as used, to correlate these to specific 
5 subscribers, or vice versa.) The encoded documents are each uniquely created as version 1 (37). version 2 
(39). version 3 (41), version 4 (43)....through version N (45). 

A variety of encoding techniques may be used in the present invention methods and these relate to 
altering lines, words or character features (or combinations) without the need to add textual, graphical, 
alphabetical, numerical or other unique identifiers, and to thereby not alert an illicit copier to the code. Thus, 
io common to all methods is that the codeword is embedded in the document by altering particular aspects of 
already existing features. For instance, consider the codeword 1101 (binary). Reading this code right to left 
from the least significant bit. the first document feature is altered for bit 1 , the second feature is not altered 
for bit 0, and the next two features are altered for the two 1 bits. It is the type of feature that distinguishes 
each particular encoding method: 
75 (1.) Line-Shift Coding- a method of altering the document format file by shifting the locations of text-lines 
to uniquely encode the document. This code may be decoded from the format file or bitmap. Lines may 
be dithered horizontally or vertically, for example. The method provides the highest reliability among 
these methods for detection of the code even in images degraded by noise. 

(2.) Feature-Enhancement Coding- a method of altering a document bitmap image by modifying certain 
20 textual element features to uniquely encode the document. One example of such a modification is to 

extend the length of character ascenders. Another is to narrow character width; another is to remove or 

shorten a character section. This type of code is encoded and decoded from the bitmap image. 

(3.) Word-Shift Coding- a method of altering the document format file or image bitmap by shifting the 

locations of words within the text to uniquely encode the document. This coding may be decoded from 
25 the format file or bitmap. This method in preferred embodiments using document format file alteration is 

similar in use to method (1). It typically provides less visible alteration of the document than method (1). 

but decoding from noisy image may be less easily performed. 

A detailed discussion will now follow regarding each of the above three encoding techniques. 
30 2.1 Text-Line Coding 

This is a coding method that is applied to a formatted document file. In the following discussion, it is 
assumed that the formatted document file is in Adobe Systems, Incorporated PostScript- the most common 
Page Description Language Format used today. However, the present invention is also applicable to other 

35 document file formatting programs. PostScript describes the document content a page at a time. Simply 
put, it specifies the content of a text-line (or text-line fragment such as a phrase, word, or character) and 
identifies the location for the text to be displayed. Text location is marked with an x-y coordinate 
representing a position on a virtual page. Depending on the resolution used by the software generating 
PostScript, the location of the text can be modified by as little as 1/720 inch (1/10 of a printer's "point"). 

40 Most laser printers in common use today have somewhat less resolution (e.g. 1/300 inch). 

In one embodiment of the present invention method, prior to distribution, the original PostScript 
document and the codeword are supplied to an encoder. The encoder reads the codeword, and searches 
for the lines which are to be moved. Upon finding a line to be moved, the encoder modifies the original 
(unspaced) Postscript file to incorporate the line spacing adjustments. This is done by increasing or 

45 decreasing the y coordinate of the line to be spaced. The encoder output is an "encoded" PostScript 
document ready for distribution in either electronic or paper form to a subscriber. 

Figure 3 illustrates how a publisher may identify the original recipient (subscriber) of a marked 
document by analysis of a recovered paper copy of the document. That is, given a questionable hard copy 
51, copy 51 is scanned by scanner 53, analyzed by computer 55, decoded with decoder program 57, 

so matched to codebook 59, to determine the source or subscriber version 61. For example, the "decoder" 
analyzes the line spacing, and extracts the corresponding codeword, uniquely identifying the original 
subscriber. 

A page (or pages) of the illicit copy of the document may be electronically scanned to produce a 
bitmap image of the page. The bitmap image may preferably be subjected to noise reduction to remove 
55 certain types of extraneous markings (i.e. noise introduced by printing a hard copy, plain paper copying, 
electronic scanning, smudges, etc.). The bitmap image may then be rotated to ensure that the text lines are 
perpendicular to the side page edge, A "profile" of the page is found- this is the number of bits on each 
horizontal scan line in the image. The number of such scan lines varies, but in our experiment, the number 
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of lines is around 40 per text-line. The distance between each pair o( adjacent text-line profiles may ther . be 
^Lr r ad This is done by one of two approaches- either the distance between the baselines of adtacent • 
Zl Se^s Jea ured.^r he difference'between centroids (i.e. centers of mass) of adjacent line profiles 
rmeasured The interline spacings are then analyzed to determine if spacing has been added or 
s repeated for every line, determines the codeword of the document- this umque.y 

determines the original subscriber. 

Advantages of this method relative to the other present invention methods, are as follows. 
The code can be decoded without the original; Decoding is quite simple: 

ffSSJ To be the most noise resistant technique. However, this method is likely to be most v,s.ble of 
w coding techniques described herein. 

Fiqure 4 illustrates a simple line spacing encoder pseudocode for PostScript files. 
Pique 5 shows a graph of a line spacing profile of a recovered document page. The scan ne of the 
. 7" Z 0 lr* VJ fin. is marked with a " + ". The scan line of the centroid of each text-hne is marked 
S "d^ De'oVnT pa S aline spacing may involve measuring the distance between adjacent te * 
^e centroids or baselines and determining whether space has been increased, decreased, or left the same 
as the standard. 

2.2 Feature Enhancement Coding 

This is a present invention coding method that is applied directly to the bitmap ^«^«*^ 
The bitmap image is examined (or chosen features, and those features are altered or not alt *red 
Lpen^g on the" codeword. These alterations may be widening. ^V^^J^^ 
adding to the features of the individual characters. For example, upward, vertical endlmes o tetters- that ,s 
me tops of letters, b. d. h. etc. may be extended. These endlines are altered by extending the.r lengths by 

oc on e (or more) pixels, but not otherwise being changed. 

This cod ng is applied upon the bitmap image, and can be detected upon the printer image • W'th mo e 
emphasized coding than suggested below, or with redundancy in the coding, it may also be detected ,n 
scanned images of printed and photocopied documents. 

30 Advantages of this present invention method are as follows: 

There are a very large number of code possibilities (perhaps 10 times more than for word-shift coding 
and 20 times more than for line-shift coding); fnrmattnd 
The coding is performed on the bitmap of the document, thus there .s no need for altering the formatted 

35 document file; This is one of the least visible methods of coding the image; 

Disadvantages include: 

This is primarily an image coding technique and is not normally applicable to the format file (the other 
40 techniques are more readily applicable to both); v isibilitv of 

This method may be less applicable to photocopied, or otherw,se no.sy documents, due to the v,s.bil.ty of 
the coding when applied with such magnitude (length) as to also be noise-intolerant; 
This code cannot be detected without the original. 

Jhe psldocodestr coding and decoding may be as fo.lows. with preference to Figure 6^ which shows 
.s three ex a r P !es o?norma. coding 63. 65 and 67. and enhanced coding 73, 75 and 77 of a 5 x 5 pixel array. 
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CODING: 

mask off the least significant codeword bit and right- 
shift the codeword 

for each pixel in image in chosen order (e.g. raster- 
scan order) 

^examine k x k (e.g. 5x5) neighborhoods of pixels 
around this center pixel 
if the pattern of pixels within the k x k mask 

matches one of the chosen features as in 

Figure 6 

^then if codeword bit is 1, alter feature as in 
Figure 6 

else if codeword bit is 0, leave feature as is 
store (x,y) location of center pixel and 1 or 0 
value of codeword bit 
if codeword = 0, break 
else mask next codeword bit and 
right-shift 

)' 

DECODING: 

read in list of codeword bits and corresponding center 
pixel locations . 

where coding has been performed on original 

image 
set codedlmage = 1 

for each (x,y) location of coded feature 

^ examine k x k neighborhood of pixels around center 

pixel location j^,,^*-^ 
if k x k region matches altered pattern and codeword 

blt 1 Jr°if pixels have not been altered pattern and 
codeword bit is 1 
then codedlmage = 0, break 

if codedlmage = 1, then image matches code 

if codedlmage = 0, then image does not match code 



50 2.3 Word-Shift Coding 

This is a coding method that is applicable to documents with variable spacing between adjacent words. 
This encoding is most easily applied to the format file. For each text-line, the largest and smallest spac.ngs 
between words are found. To code a line, the largest spacing is decremented by some amount and the 
55 smallest is augmented by the same amount. This maintains the overall text-line length, and produces little 
qualitative change on the text image. 

Advantages of this method relative to other present invention methods are as follows: 
It is one of the least visible methods of coding the image; 
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The code cannot be decoded without the original; 
Disadvantages include: 

The code cannot be decoded without the original; 
The pseudocode is as below: 

CODING: 

mask off the least significant codeword* bit and right- 
shift the codeword 

for each text iine in the format file 
{ 

if the code bit is 1 

find the longest space between words 

find the shortest space between words 

shorten longest space and lengthen shorter space 
by a chosen amount (must be <= longest space- 
shortest space) 

store text-line number, altered space 
positions, and codebit 

if codeword = 0 # break 

else mask next codeword bit and 

right-shift 

} 

} 
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DECODING ! 



read in list of codeword bits and corresponding line 
numbers 

35 and space locations in text -lines 

where coding has been performed on original 
image 

set codedlmage =1 . . . 

for each text -line in coded formatted file and original 

40 formatted file 

^if codeword bit for a text -line is 1 

then , . 

if coded spaces in coded image are not 
45 different from corresponding 

spaces in original image 
then codedlmage = 0, break 

if codedlmage = 1, then image matches code 
» else if codedlmage = 0, then image does not match code 



2.4 Illustrative Review of Altering Techniques 

Figure 7 illustrates an example of line-shift encoding. Note that the second line 83 is shifted down from 
first ine 81 by approximately 1/150 inch, which equals delta. Due to differential coding, th.s causes he 
pacing between the first line 81 and the second line 83 to be greater than normal and the spac.ng between 
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second line 83 and third line 85 to be less than normal. The spaces between third line 85. fourth line 87 and 
filth line 89 are normal. _ QQ ni 1M 

Figures 8 and 9 illustrate word-shift encoding. Figure 8 shows vertical lines 91. 93. 95. 97. 99 .101. 1UJ. 
105 and 107. These lines are vertical guidelines to show the positioning of each of the words in the top and 
bottom lines of Figure 8. The word "for" has intentionally been shifted and. therefore, rests at vertical line 
99 in the bottom line of text and against vertical line 101 in the top line of text. Figure 9 shows the same 
text as in figure 8. but without the vertical lines to demonstrate that both unshifted and shifted word spacing 
appears natural to the untrained eye. 

Figures 10 11 and 12 illustrate feature enhancement encoding, figure 10 shows characters of text 
which have not been altered and would typically represent the standard or reference document Figure 11 
shows the same document but with feature enhancement added. Note, for example, that 1 has been 
vertically extended as have the letters »t". "I". and »d» in the first line as well as other ' peters 
elsewhere. Figure 12 shows the same feature enhancements as in figure 11. but with exaggeration to 
simply emphasize the enhancement. Based on Figures 11 and 12. an appropriate codeword for the 
enhanced feature documents would be 5435 decimal. 

3.0 Application of Error Correction 

Due to noise which may be introduced in the recovered document, the identification process is subject 
to error Clever choices of the set of codewords used to space lines (based on Error Correcting Codes) w.l 
be used to minimize the chance of detection error. This will establish a tradeoff between the number of 
potential recipients of a document (i.e. the number of codewords) and the probability of correct identifica- 
tion. To illustrate, the following discussion gives detail of how line-shift decoding may preferably be 
enhanced by noise removal. 

In general, in the present invention methods, a line-shift decoder extracts a codeword from a (possibly 
degraded) bitmap representation of an encoded document (decoding a recovered, unmodified formatted 
document file is trivial). An illicit copy of an encoded document may be recovered in either electronic or 
paper form. If paper is recovered, a page (or pages) of the document is electronically scanned producing a 
bitmap image of the page(s). Extracting the code from an image file is not as straightforward as doing so 
from the format file. Since the image contains ON and OFF bits, rather than ASCII text and formatting 
commands, pattern recognition techniques must be used first to determine the content. Furthermore, since 
noise may be present, image processing techniques are performed to reduce noise and make the job of 
pattern recognition more robust. Some of the techniques used for document decoding from the image are 
as follows: 

Salt-and-Pepper Noise Removal- Inking irregularities, copier noise, or just dirt on the paper can 
cause an image to contain black specks in background areas, and white specks within foreground areas 
such as text. Since this noise interferes with subsequent processing, it is desirable to reduce it as much as 

possible. . . , 

A kFill filter is used, which is designed to reduce salt-and-pepper noise while maintaining document 
quality it does so by discriminating noise from true text features (such as periods and dots) and removing 
the noise. It is a conservative filter, erring on the side of maintaining text features versus reducing noise 
when those two conflict and has been described for document clarification and is known by the artisan. 

Deskewing-- Each time that paper documents are photocopied and scanned, the orientation of the text 
lines on the page may be changed from horizontal because of misorientation- skewing- of the page. In 
addition a photocopier may also introduce skewing due to the slight non-linearity of its optics. The success 
of subsequent processing requires that this skew angle be corrected- that the text lines be returned to the 

horizontal in the image file. 

One approach for deskewing by use of the document spectrum, or docstrum, technique is a bottom-up 
segmentation procedure that begins by grouping characters into words, then words into text lines. The 
average angle of the text lines is measured for a page, and if this is non-zero (not horizontal), then the 
image is rotated to zero skew angle. Rotation, followed by bilinear interpolation to achieve the final 
deskewed image, is a standard digital image processing procedure that can be found in the published 

literature. . . . 

Text-Line Location- After deskewing, the locations of the text lines can be found. A standard 
document processing technique called the projection profile is used. This is simply a summation of the 
ON-valued pixels along each row. For a document whose text lines span horizontally, this profile will have 
peaks whose widths are equal to the character height and valleys whose widths are equal to the white 
space between adjacent text lines. The distances between profile peaks determine interline spacing. 
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In one preferred embodiment, the present invention line-shift decoder measures the distance between 
each pair of adjacent individual text line profiles (within the page profile). This is done by one of two 
approaches- either by measuring the distance between the baselines of adjacent line profiles, or by 
measuring the difference between centroids of adjacent line profiles, as mentioned above. A baseline is the 
logical horizontal line on which characters sit; a centroid is the center of mass of a text line. As seen in 
Figure 5 discussed above, each text line produces a distinctive profile with two peaks, corresponding to the 
midline and the baseline. The peak in the profile nearest the bottom ol each text line is taken to be the 
baseline- if equal peak values occur on neighboring scan lines, the largest value scan l.ne is chosen as the 
baseline scan line. To define the centroid of a text line precisely, suppose the text line profile runs from 
scan line y, y + 1. — . to y + w. and the respective number of ON bits/scan line, are h(y). h(y + U — . 
h(y + w). Then the text line centroid is given by 



is 



20 



25 



y h(y) + --•+ (y + w)h(y + v) 
h(y) + ••• + h(y + w) 

The measured interline spacings (i.e. between adjacent centroids or baselines) are used to determine if 
white space has been added or subtracted because of a text line shift. This process, repeated for every 
line determines the codeword of the document* this uniquely determines the original recipient. 

The decision rules for detection of line shifting in a page with differential encoding are described. 
Suppose text lines / - 1 and / + 1 are not shifted and text line / is either shifted up or down In the 
unspaced document, the distance between adjacent baselines, or baseline spacings are the same^Let s,-, 
and s f be the distances between / -1 and /. and between baselines / and /' + 1. respectively. Then the 
decision rule is: 



if Si - i> s, : decide line i shifted down 
it $,. ,< Si : decide line i shifted up 
30 otherwise : uncertain 

Baseline Detection Decision Rule (3.2) 

Unlike baseline spacings. centroid spacings between adjacent text lines in the original unspaced 
document are not necessarily uniformly spaced. In centroid-based detection, the decision is based on the 
as difference of centroid spacings in the spaced and unspaced documents. More specifically, let s, - , and s, 
be the centroid spacings between lines / - 1 and /. and between lines / and / + 1, respectively, in the 
spaced document; let f, . , and t, be the corresponding centroid spacings in the unspaced document. Then 
the decision rule is: 

40 it Si- , - tj- , > s, - t, : decide line i shifted down 
otherwise : decide line i shifted up 
Centroid Detection Decision Rule (3.3) 

An error is said to occur if the decoder decides that a text line was moved up (down) when it was 
moved down (up). In baseline detection, a second type of error exists. The decoder is uncertain if it cannot 
determine whether a line was moved up or down. Since in the encoding every other line is moved, and this 
information is known to the decoder, false alarms do not occur. 

4.0 Experimental Results 

Two sets of experiments were performed. The first set was designed to test how well line-shift coding 
works with different font sizes and different line spacing shifts in the presence of limited, but typical image 
noise The second set test was designed to discover how well a fixed line spacing shift could be detected 
as document degradation became increasingly severe. The equipment used in both experiments was as 
55 follows: 

1. Ricoh FS1S 400 dpi Flat Bed Electronic Scanner 

2. Apple LaserWriter llntx 300 dpi laser printer 

3. Xerox 5052 plain paper copier. 
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The printer and copier were selected in part because they are typical of the equipment found in wide 
use in office environments. The particular machines used could be characterized as being heavily used but 
well maintained. Xerox and 5052 are trademarks of Xerox Corp. Apple and LaserWriter are trademarks of 
Apple Computer, Inc. Ricoh and FSI are trademarks of Ricoh Corp. 

4.1 Variable Font Size Experiment 

The first set of experiments each uses a single-spaced page of text in the Times-Roman font. The page 
is coded using the differential encoding scheme. In differential encoding, every other line of text in each 
paragraph was kept unmoved, starting with the first line of each paragraph. Each line between two unmoved 
lines was always moved either up or down. That is. for each paragraph, the 1st, 3rd, 5th, etc. lines were 
unmoved, while the 2nd, 4th, etc. lines were moved. Nine experiments were performed using font sizes of 
8, 10 or 12 pixels and shifting alternate lines (within each paragraph) up or down by 1, 2, or 3 pixels. Since 
the printer has a 300 dpi resolution, each pixel corresponds to 1/300 inch, or approximately one-quarter 
point. Each coded page was printed on the laser printer, then copied three times. The laser printed page 
will be referred to as the 0th copy; the nth copy, n^^ t is produced by copying the n - 1st copy. The third 
copy was then decoded to extract the codeword. That is. the third copy was electronically scanned, the 
bitmap image processed to generate the profile, the profile processed to generate the text line spacings 
(both baseline and centroid spacings), and the codeword detected using these measurements and rules 
(3.2-3). 

Figure 13 presents the results of the variable font size experiment for one page of single-spaced text. 
Note that as the font size decreases, more lines can be placed on the page, permitting more information to 
be encoded. Both baseline and centroid approaches detected without error for spacings of at least 2 pixels; 
the centroid approach also had no errors for a 1 pixel spacing. 

Though it is not shown in Figure 13. it is noteworthy that some variability will occur in the detection 
performance results, even in repeated "decoding" of the same recovered page. This variability is due in 
part to randomness introduced in electronic scanning. If a page is scanned several times, different skew 
angles will ordinarily occur. The skew will be corrected slightly different in each case, causing detection 
results to vary. 

To illustrate this phenomena, the test case (8 point text. 1 pixel spacing) was rescanned 3 additional 
times. The initial text line skew angle (i.e. before deskewing) differed for each scan. In the three rescans, 
the following decoding results were observed under baseline detection: 5 uncertain. 3 uncertain and 1 error, 
and 6 uncertain. Curiously, the line spacings that could not be detected or were in error varied somewhat 
across the retries. This suggests that there may be some decoding performance gained by scanning a 
single page multiple times, and combining the results (e.g. averaging). 

4.2 Plain Paper Copying Experiment 

For the second set of experiments, a single-spaced page of text was coded using differential encoding. 
The font was fixed to be Times-Roman, font size to be 10 point, and the coding line-shift to be 1 pixel. 
Repeated copies (the 0th, lst,...l0th copy) of the page were then made, and each copy used in a separate 
experiment. Hence, each successive experiment used a slightly more degraded version of the same text 
page. The experimental results are tabulated in Figure 14. 

No errors were observed through the 10th recursive copy using centroid detection. What is even more 
remarkable is that less than half the available signal to noise "margin" has been exhausted by the 10th 
copy. This suggests that many more copies would likely be required to produce even a single error- such a 
document would be illegible! 

Figure 14 shows that, for baseline decoding, detection errors and uncertainties do not increase 
monotonically with the number of copies. Further, the line spacings that could not be detected correctly 
varied somewhat from copy to copy. This suggests that line spacing "information" is still present in the text 
baselines, and can perhaps be made available with some additional processing. 

The results of Figure 14 report the uncoded error performance of our marking scheme. But the 21 line 
shifts used in the experiment were not chosen arbitrarily. The codeword comprised 3 concatenated 
codewords selected from a Hamming block code, a 1 -error correcting code. Hence, roughly each 13 pa9e 
was protected from 1 error. Many, but not all, of the errors and uncertainties resulting from baseline 
decoding would have been corrected by this encoding. However, since uncoded centroid detection 
performed so well, it is unclear whether there is any need to supplement it with error correction. 
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5.0 Discussion and Implicatio ns of Image Defects 

information in the relative rather than absolute shifts ^^^^ verticat disp , a cements and 

is not dramatically affected by this particular image degradation. 
25 5.1 An Analytical Noise Model 

"Ttlpag^ JiVl" « i£ be described by the centroid spacings u . - - * is assumed that the 

, 5 P^^^T^S!^ ^ »n£* Gaussian random variab.es. This 

where vi. I = 1. a fe tnaepenuo™ '„„„.... _ Q 052 8 pixel and variance of 

assumption is supported by the measurements. wh,ch y.eld a mean of u, 0.0528 p.xe 

— ,s n \ where 

centroid_spac ; ngs ar e; denoted by * . ^ « £ ^ fay copyjng ^ , . tst copy . 
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Hence, the centroid spacing sf is corrupted by the total noise: 
$J = s , + {Nf + + % (4.4) 

5 The measurements taken suggest a surprisingly simple statistical behavior for the random copier noise. The 
noise components Nf, j- 1.2, — ,K, are well modeled by Gaussian random variables with mean u = 0.066 
pixel and variance 62 = 0.017 pixeP. The measurements suggest that the random variables N,\ —,Nf are 
also uncorrected, and by normality, they are thus independent. Hence, the centroid spacing $} on the Ah 
copy is 

w 

sf = Si + *\ / = (4.5) 

where nf is Gaussian with mean ju and variance 

Printer noise and copier noise is now combined to estimate the error probability under centroid 
/5 detection. Consider three adjacent, differentially encoded text lines labeled such that lines / - 1 and / + 1 
are unshifted while line / is shifted (up or down) by c pixels. Let f,_, and U be the centroid spacings 
between these lines in the original unspaced document, and let s,- , and s, be the corresponding spacings 
on the 0th copy of the encoded document. Then 

20 s,-, = fc-, + c + Vj_,, (4.6) 

$i = h - c + v h (4.7) 

where c = +1 if line / is shifted down and c = -1 if line / is shifted up. Let s' h} and $f be the 
25 corresponding centroid spacings on the /th copy of the document. Then 

si)-i = f,--i + c + v ( _, + nVi (4.8) 

sf = f, - c + v, + (4.9) 

30 

where if are defined in (4.5). 

Suppose the /th copy of the document is recovered and is to be decoded. Applying the above (4.8 and 
4.9) to the detection rule (3.3): 

35 if v,-i - v ( > if/ • ti^-i • 2c : decide line shifted down (4.10) 

otherwise : decide line shifted up 

Since the random variables v;_i. V:> and ijVi, nf are mutually independent, the decision variable 

is Gaussian with zero mean and variance 2(5, 2 + jb 2 ). Hence, the probability that a given line is decoded in 
error is 

45 

p(D>-2c up shift) = ±p(DS-2c down shift) = ip(D$-2). (4.11) 

The error probability is easily evaluated using the complementary error function. Using the measurement 
5i 2 = 0.140 and 6 2 = 0.017, the error probability is only approximately 2% on the 20th copy. 

50 

5.2 Comparison of Baseline and Centroid Detection Algorithms 

Detection using either the baseline or centroid of a text line profile offers distinct advantages and 
disadvantages. As expected, the experimental results reveal that centroid-based detection outperforms 
55 baseline-based detection for pages encoded with small line shifts (i.e. 1 pixel) and subject to large 
distortion. This performance difference arises largely because baseline locations are integer valued, while 
centroid locations, being averages, are real valued. Recall that baseline locations are determined by 
detection of a peak in the text line profile. Sometimes this peak is not pronounced- the profile value on scan 
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lines neighboring the baseline are often near the peak value. Hence, relatively little noise can cause the 
peak to shift to a neighboring scan line. A single scan line shift is sufficient to introduce a detection error 
when text lines are encoded with a 1 pixel shift. 

It also appears likely that centroids are less subject to certain imaging defects than are baselines. 
Baselines -appear relatively vulnerable to line skew (or more precisely, the noise introduced by deskewing). 
Though centroid detection outperforms baseline detection, the latter has other benefits. In particular, 
encoded documents can be decoded without reference to the original, unspaced document. A secure 
document distributor would then be relieved of the need to maintain a library of original document centroid 
spacings for decoding. 

Finally, both detection techniques can be used jointly (and indeed, with other techniques) to provide a 
particularly robust, low error probability detection scheme. 



6.0 Conclusion 

is Making and distributing illegitimate copies of documents can be discouraged if each of the original 
copies is unique, and can be associated with a particular recipient. Several techniques for making text 
documents unique have been described. One of these techniques, based on text line shifting, has been 
implemented as a set of experiments to demonstrate that perturbations in line spacing that are small 
enough to be indiscernible to a casual reader can be recovered from a paper copy of the document, even 

20 after being copied several times. 

In the experiments, the position of the odd numbered lines within each paragraph remains the same 
while the even numbered lines are moved up or down by a small amount. By selecting different line shifts, 
information is encoded into the document. If the document remains in the electronic form throughout the 
experiment, retrieving the encoded information is trivial. To retrieve the information from a paper copy, the 

25 document is scanned back into the computer. Two detection methods have been considered, one based on 
the location of the bottom of the characters on each line, and the other based on the center of mass of each 
line. The advantage of using the baselines is that they are equally spaced before encoding and the 
information can be retrieved without reference to a template. The centers of mass of the lines are not 
equally spaced, however, this technique has been found to be more resilient to the types of distortion 

30 encountered in the printing and copying process. 

The differential encoding mechanism has been selected because the types of distortion that have been 
encountered have canceled out when differences between adjacent lines are considered. In the experi- 
ments, the lines in the document are moved up or down by as little as 1/300 inch, the document is copied 
as many as ten times, then the document is scanned into a computer and decoded. For the set of 

35 experiments that have been conducted, the centroid decoding mechanism has provided an immeasurably 
small error rate. 

Obviously, numerous modifications and variations of the present invention are possible in light of the 
above teachings. It is therefore understood that within the scope of the appended claims, the invention may 
be practiced otherwise than as specifically . described herein. 

40 

Claims 

1. A method of deterring the illicit copying of electronically published documents, which comprises: 

(a) utilizing a computer system to electronically publish a plurality of copies of a document having 
45 electronically created material thereon for distribution to a plurality of subscribers; 

(b) operating programming within said computer system so as to perform the following steps: 

(i) encoded said plurality of copies each with a separate, unique identification code, said 
identification code being based on a unique arrangement of the electronically created material on 
each such copy; and. 

50 (ij) creating a codebook to correlate each such identification code to a particular subscriber. 

2. The method of claim 1 wherein said printed material is lines of textual material and said unique 
arrangement of the electronically created material is based on line-shift coding. 

55 3. The method of claim 2 wherein said line-shift coding is accomplished by altering a document format file 
to shift locations of at least one line relative to other lines contained in the document to uniquely 
encode each copy of the document. 
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4. The method of claim 2 wherein said line-shift coding is accomplished by altering a document bitmap 
image to shift locations of at least one line relative to other lines contained in the document to uniquely 
encode each copy of the document. 

5 5. The method of claim 1 wherein said electronically created material includes words arranged in a 
predetermined sequence and said unique arrangement of the material is based on word-shift coding, 

6. The method of claim 5 wherein said word-shift coding is accomplished by altering a document format 
file to shift locations of at least one word relative to other words contained in the document to uniquely 

to encode each copy of the document. 

7. The method of claim 5 wherein said word-shift coding is accomplished by altering a document bitmap 
image to shift locations of at least one word relative to other words contained in the document to 
uniquely encode each copy of the document. 

/5 

8. The method of claim 1 wherein said electronically created material includes standardized print features 
and said unique arrangement of the material is based on feature-altered coding. 

9. The method of claim 8 wherein said feature-altered coding is accomplished by altering a document 
20 bitmap image to alter at least one print feature relative to said standardized print features. 

10. The method of claim 1, wherein operating said programming also includes performing the steps of: 

(iii) creating a first copy of said document as a standard document; 

(iv) creating a plurality of subsequent copies, each with at least one alteration rendering it different 
25 from said standard document, and each being different from one another so that each of said copies 

has a unique identification code based on said at least one alteration; 

(v) comparing each subsequent copy with said standard document to identify a sequence of same 
and different aspects of each such copy relative to said standard document; and, 

(vi) converting said sequence of same and different aspects to a unique binary identification code. 

30 

11. The method of claim 1, wherein said computer system includes a scanner device connected thereto 
and operating said programming includes performing the steps of: 

(1) scanning a copy of a document to feed its image into said computer system; 

(2) analyzing and decoding the image to determine its unique identification code; and, 

35 (3) comparing the resulting identification code to the codebook to determine the particular subscriber 

to which the identification code correlates. 

12. The method of claim 1 wherein operating said programming includes performing the steps of: 

(1) receiving a bitmap image for a copy of a previously electronically published document; 
40 (2) analyzing the bitmap image to determine its unique identification code; and, 

(3) comparing the resulting identification code to the codebook to determine the particular subscriber 
to which the identification code correlates. 

13. The method of claim 1. wherein operating said programming includes performing the steps of: 

45 (1) receiving a document format file for a copy of a previously electronically published document; 

(2) analyzing the document format file to determine its unique identifcation code; and 

(3) comparing the resulting identification code to the codebook to determine the particular subscriber 
to which the identification code correlates. 

so 14. The method of claim 1 1 wherein operating said programming also includes the step of noise reduction. 

15. The method of claim 14 wherein operating said programming also includes the step of noise reduction. 

16. The method of claim 14 wherein said analyzing and decoding is based on baseline differential 
55 ■ determinations, or on centroid differential determinations. 

17. The method of claim 1 1.12 or 13 wherein said copy of a document is from a document which has been 
encoded by word-shift alteration, or by feature enhancement alteration, or by line-shift alteration. 
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FIG. 4 
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FIG. 6 
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FIG. 7 

This is a method of altering a document by vertically 
shifting the location of text lines to uniquely encode the 
document. The encoding is most easily applied to the 
format file. The embedded code word may be decoded 
from the format file or bitmap. 
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FIG. 10 



IS AND 


1 


Incremental Mod 


HEORY 




by John H. Mulle 


STEMS 


20 


a uniiiea rromeN 






by J. A. Brzozo* 








:s AND 


1 


Incremental Mod 


HEORY 

lib \m* Y\ 1 




bv John H. Mulle 


STEMS 


20 


A Unified rrame\ 






bv J. A. Brzozovs 






12 


:s asId 




Incremental Mod 


HEORY 




by John H. Millie 


STEMS 


20 


A Unified Frame\ 






by J. A. Brzozovs 



20 



EP 0 660 275 A2 



FIG. 13 
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FIG, 15 
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(54) Document copying deterrent method 

(57) The present invention is directed to a method 
of deterring the illicit copying of electronically published 
documents. It includes utilizing a computer system to 
electronically publish a plurality of copies of a document 
having electronically created material thereon for distri- 
bution to a plurality of subscribers and operating pro- 
gramming within the computer system so as to perform 
the identification code functions. The steps are to 
encode the plurality of copies each with a separate, 
unique identification code, the identification code being 
based on a unique arrangement of the electronically 
created material on each such copy: and, creating a 
codebook to correlate each such identification code to a 
particular subscriber. In some embodiments, decoding 
methods are included with the encoding capabilities. 
The unique arrangement of the electronically created 
material may be based on line-shift coding, word-shift 
coding, or feature enhancement coding (or combina- 
tions of these) and may be effected through bitmap 
alteration of document format file alteration. 
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